When consumers or subscribers stop doing business with a company or service, this is known as customer churn. Customers in the telecom industry have a number of service providers from which to choose and can actively switch from one to the next. In this extremely competitive sector, the telecoms industry has an annual turnover rate of 15-25 percent. Because most businesses have a huge number of clients and can't afford to spend much time with each of them, personalized customer retention is difficult. The additional money would be insufficient to cover the costs. If a company could predict which customers are likely to depart ahead of time, it might focus customer retention efforts only on these "high risk" consumers.
The major goal of the customer churn prediction model is to proactively engage with customers who are most likely to churn. Offer a gift certificate or special price and lock them in for another year or two to increase their lifetime worth to the firm.
There are two broad concepts to understand here:
To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. To detect early signs of potential churn, one must first develop a holistic view of the customers and their interactions across numerous channels, including store/branch visits, product purchase histories, customer service calls, Web-based transactions, and social media interactions, to mention a few. To detect early indicators of possible churn, create a holistic picture of consumers and their interactions across many channels, such as store/branch visits, product purchase histories, customer care calls, Web-based transactions, and social media interactions, to name a few. As a consequence, managing churn may help these companies not only maintain their market position, but also expand and prosper. The lower the cost of initiation and the higher the profit, the more clients they have in their network. As a result, the company's primary focus for success is on lowering client attrition and developing a successful retention plan.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score, classification_report
df = pd.read_csv('D:/Python/ML/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()
df.shape
df.info()
df.columns.values
df.dtypes
msno.matrix(df);
We can rapidly detect the missingness trend in the dataset to use this matrix.
We can see from the above visualization that there is no unusual pattern that sticks out. In truth, there is no data that is missing.
df = df.drop(['customerID'], axis = 1)
df.head()
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')
df.isnull().sum()
Here we see that the TotalCharges has 11 missing values. Let's check this data.
df[np.isnan(df['TotalCharges'])]
It can also be noted that the Tenure column is 0 for these entries even though the MonthlyCharges column is not empty. Let's see if there are any other 0 values in the tenure column.
df[df['tenure'] == 0].index
There are no additional missing values in the Tenure column. Let's delete the rows with missing values in Tenure columns since there are only 11 rows and deleting them will not affect the data.
df.drop(labels=df[df['tenure'] == 0].index, axis=0, inplace=True)
df[df['tenure'] == 0].index
To solve the problem of missing values in TotalCharges column, I decided to fill it with the mean of TotalCharges values.
df.fillna(df["TotalCharges"].mean())
df.isnull().sum()
df["SeniorCitizen"]= df["SeniorCitizen"].map({0: "No", 1: "Yes"})
df.head()
df["InternetService"].describe(include=['object', 'bool'])
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[numerical_cols].describe()
s_labels = ['Male', 'Female']
I_labels = ['No', 'Yes']
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=s_labels, values=df['gender'].value_counts(), name="Gender"),
1, 1)
fig.add_trace(go.Pie(labels=I_labels, values=df['Churn'].value_counts(), name="Churn"),
1, 2)
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name", textfont_size=16)
fig.update_layout(
title_text="Gender and Churn Distributions",
# Add annotations in the center of the donut pies.
annotations=[dict(text='Gender', x=0.16, y=0.5, font_size=20, showarrow=False),
dict(text='Churn', x=0.84, y=0.5, font_size=20, showarrow=False)])
fig.show()
26.6 % of customers switched to another firm. Customers are 49.5 % female and 50.5 % male.
df["Churn"][df["Churn"]=="No"].groupby(by=df["gender"]).count()
df["Churn"][df["Churn"]=="Yes"].groupby(by=df["gender"]).count()
plt.figure(figsize=(6, 6))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3)
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Churn Distribution w.r.t Gender: Male(M), Female(F)', fontsize=15, y=1.1)
# show plot
plt.axis('equal')
plt.tight_layout()
plt.show()
There is negligible difference in customer percentage/ count who chnaged the service provider. Both genders behaved in similar fashion when it comes to migrating to another service provider/firm.
fig = px.histogram(df, x="Churn", color="Contract", barmode="group", title="<b>Customer contract distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
About 75% of customer with Month-to-Month Contract opted to move out as compared to 13% of customrs with One Year Contract and 3% with Two Year Contract
labels = df['PaymentMethod'].unique()
values = df['PaymentMethod'].value_counts()
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_layout(title_text="<b>Payment Method Distribution</b>")
fig.show()
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.MonthlyCharges[(df["Churn"] == 'No') ],
color="Red", shade = True);
ax = sns.kdeplot(df.MonthlyCharges[(df["Churn"] == 'Yes') ],
ax =ax, color="Blue", shade= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Monthly Charges');
ax.set_title('Distribution of monthly charges by churn');
Customers with higher Monthly Charges are also more likely to churn
fig = px.box(df, x='Churn', y = 'tenure')
# Update yaxis properties
fig.update_yaxes(title_text='Tenure (Months)', row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text='Churn', row=1, col=1)
# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
title_font=dict(size=25, family='Courier'),
title='<b>Tenure vs Churn</b>',
)
fig.show()
New customers are more likely to churn
def object_to_int(dataframe_series):
if dataframe_series.dtype=='object':
dataframe_series = LabelEncoder().fit_transform(dataframe_series)
return dataframe_series
df = df.apply(lambda x: object_to_int(x))
df.head()
plt.figure(figsize=(14,7))
df.corr()['Churn'].sort_values(ascending = False)
X = df.drop(columns = ['Churn'])
y = df['Churn'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30, random_state = 40, stratify=y)
Since the numerical features are distributed over different value ranges, I will use standard scalar to scale them down to the same range.
def distplot(feature, frame, color='r'):
plt.figure(figsize=(8,3))
plt.title("Distribution for {}".format(feature))
ax = sns.distplot(frame[feature], color= color)
num_cols = ["tenure", 'MonthlyCharges', 'TotalCharges']
for feat in num_cols: distplot(feat, df)
df_std = pd.DataFrame(StandardScaler().fit_transform(df[num_cols].astype('float64')),
columns=num_cols)
for feat in numerical_cols: distplot(feat, df_std, color='c')
# Divide the columns into 3 categories, one ofor standardisation, one for label encoding and one for one hot encoding
# those that need one-hot encoding
cat_cols_ohe =['PaymentMethod', 'Contract', 'InternetService']
#those that need label encoding
cat_cols_le = list(set(X_train.columns)- set(num_cols) - set(cat_cols_ohe))
scaler= StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
from sklearn import metrics
model_rf = RandomForestClassifier(n_estimators=500 , oob_score = True, n_jobs = -1,
random_state =50, max_features = "auto",
max_leaf_nodes = 30)
model_rf.fit(X_train, y_train)
# Make predictions
prediction_test = model_rf.predict(X_test)
print (metrics.accuracy_score(y_test, prediction_test))
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, prediction_test),
annot=True,fmt = "d",linecolor="k",linewidths=3)
plt.title(" RANDOM FOREST CONFUSION MATRIX",fontsize=14)
plt.show()
y_rfpred_prob = model_rf.predict_proba(X_test)[:,1]
fpr_rf, tpr_rf, thresholds = roc_curve(y_test, y_rfpred_prob)
plt.plot([0, 1], [0, 1], 'k--' )
plt.plot(fpr_rf, tpr_rf, label='Random Forest',color = "r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve',fontsize=16)
plt.show();
lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)
accuracy_lr = lr_model.score(X_test,y_test)
print("Logistic Regression accuracy is :",accuracy_lr)
lr_pred= lr_model.predict(X_test)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, lr_pred),
annot=True,fmt = "d",linecolor="k",linewidths=3)
plt.title("LOGISTIC REGRESSION CONFUSION MATRIX",fontsize=14)
plt.show()
y_pred_prob = lr_model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--' )
plt.plot(fpr, tpr, label='Logistic Regression',color = "r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve',fontsize=16)
plt.show();
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)
predictdt_y = dt_model.predict(X_test)
accuracy_dt = dt_model.score(X_test,y_test)
print("Decision Tree accuracy is :",accuracy_dt)
Let's now forecast and rate the final model which is based on the largest majority of votes.
from sklearn.ensemble import VotingClassifier
clf1 = DecisionTreeClassifier()
clf2 = LogisticRegression()
clf3 = RandomForestClassifier()
eclf1 = VotingClassifier(estimators=[('gbc', clf1), ('lr', clf2), ('abc', clf3)], voting='soft')
eclf1.fit(X_train, y_train)
predictions = eclf1.predict(X_test)
print("Final Accuracy Score ")
print(accuracy_score(y_test, predictions))
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, predictions),
annot=True,fmt = "d",linecolor="k",linewidths=3)
plt.title("FINAL CONFUSION MATRIX",fontsize=14)
plt.show()
Customer turnover has a negative impact on a company's profitability. To reduce customer turnover, a variety of techniques can be used. Customer turnover may be avoided by a firm getting to know its consumers. This entails identifying consumers who are likely to leave and attempting to enhance their satisfaction. Improving customer service is, of course, a key concern in addressing this problem. Another way to decrease client turnover is to build customer loyalty through meaningful experiences and customized service. Some businesses conduct surveys of existing customers to learn why they left in order to take proactive measures to prevent future customer churn.